Take Home Exercise 02

Author

Huynh Minh Phuong

Overview

Oceanus has enlisted FishEye International’s assistance in identifying potentially illegal fishing companies. FishEye’s analysts were provided with import/export data for Oceanus’ marine and fishing sectors, but the data was incomplete. To aid their analysis, FishEye transformed the trade data into a knowledge graph to understand business relationships, specifically to combat illegal, unreported, and unregulated (IUU) fishing and protect affected marine species. While node-link diagrams provided a high-level overview of the knowledge graph, FishEye now seeks visualizations that offer more detailed patterns about entities within the graph. The analysis consists of two main parts.

Firstly, FishEye aims to visualize temporal patterns to determine if companies engaging in illegal fishing activities have reemerged under different names after shutting down. They seek assistance in comparing the activities of these companies over time.

Secondly, FishEye has employed various tools, including artificial intelligence, to reason on the knowledge graph and propose additional links to expand the dataset. They have presented 12 groups of link suggestions and require aid in evaluating the reliability of these tools for completing the graph. FishEye is particularly interested in identifying new temporal patterns or anomalies that arise when new links are added.

We aim to use visual analytics to help FishEye identify companies that may be engaged in illegal fishing.

Data set

The graph is structured in a JSON format designed to align with d3’s node-link format and ensure compatibility with networkx.node_link_graph. At the top level, it consists of a dictionary containing graph-level properties, indicated by keys such as “directed,” “multigraph,” and “graph.” The “nodes” and “links” keys each hold a dictionary of nodes and links, respectively.

For nodes, the entries must include a unique “id” key for each node. As for links, they consist of “source” and “target” keys, referring to the node id values. Any additional keys provided within the node and link dictionaries serve as attributes for those specific nodes or links.

Node Attributes:

  • id – Name of the company that originated (or received) the shipment

  • shpcountry – Country the company most often associated with when shipping

  • rcvcountry – Country the company most often associated with when receiving

  • dataset – Always ‘MC2’

Edge Attributes:

  • arrivaldate – Date the shipment arrived at port in YYYY-MM-DD format.

  • hscode – Harmonized System code for the shipment. Can be joined with the hscodes table to get additional details.

  • valueofgoods_omu – Customs-declared value of the total shipment, in Oceanus Monetary Units (OMU)

  • volumeteu – The volume of the shipment in ‘Twenty-foot equivalent units’, roughly how many 20-foot standard containers would be required. (Actual number of containers may have been different as there are 20ft and 40ft standard containers and tankers that do not use containers)

  • weightkg – The weight of the shipment in kilograms (if known)

  • dataset – Always ‘MC2’

  • type – Always ‘shipment’ for MC2

  • generated_by – Name of the program that generated the edge. (Only found on ‘bundle’ records.)

Load packages

The main packages for plotting graphs are: igraph, tidygraph, ggraph, visNetwork, sf, sfnetworks. The rest of the packages are for data wrangling.

Show code
pacman::p_load(igraph,tidygraph,ggraph,visNetwork,ggmap,tmap,ggplot2,tidyverse,graphlayouts,jsonlite,zoo,plotly)

Import data

Use fromJson from package jsonlite to import main graph data from json file format

Show code
MC2<-fromJSON("data/mc2_challenge_graph.json")

Extract node information for the main graph

We will ignore the dataset column and use select to select id, shipping country and receiving country information. Use glimpse to view the data.

Show code
MC2_nodes <- as_tibble(MC2$nodes) %>% 
  select(id, shpcountry,rcvcountry)

glimpse(MC2_nodes)
Rows: 34,576
Columns: 3
$ id         <chr> "AquaDelight Inc and Son's", "BaringoAmerica Marine Ges.m.b…
$ shpcountry <chr> "Polarinda", NA, "Oceanus", NA, "Oceanus", "Kondanovia", NA…
$ rcvcountry <chr> "Oceanus", NA, "Oceanus", NA, "Oceanus", "Utoporiana", NA, …

Use as.factor to convert the character data type to factor.

Show code
MC2_nodes$id <-as.factor(MC2_nodes$id)
MC2_nodes$shpcountry<-as.factor(MC2_nodes$shpcountry)
MC2_nodes$rcvcountry<-as.factor(MC2_nodes$rcvcountry)

summary(MC2_nodes)
       id             shpcountry           rcvcountry   
 -1     :    1   Marebak   : 3657   Oceanus     :29312  
 -10    :    1   Alverossia:  600   Coralmarica :  552  
 -100   :    1   Isliandor :  437   Sol y Oceana:  277  
 -1000  :    1   Merigrad  :  435   Marebak     :  151  
 -10000 :    1   Oceanus   :  418   Riodelsol   :   99  
 -10001 :    1   (Other)   : 6670   (Other)     : 1276  
 (Other):34570   NA's      :22359   NA's        : 2909  
Note

We observed many NA in the shpcountry and rcvcountry columns. Since we are unable to infer the values for these columns, we need to remove rows with NA values.

We use group_by to check for each company id how many different shipcountry and rcvcounty are available to identify the overall trend. Each company only ship to 1 receiving country and ship from 1 shipping country.

Show code
count<-MC2_nodes %>% 
  group_by(id) %>% 
  summarise(no_shpcountry=n_distinct(shpcountry),
            no_rcvcountry=n_distinct(rcvcountry)) 

summary(count)
       id        no_shpcountry no_rcvcountry
 -1     :    1   Min.   :1     Min.   :1    
 -10    :    1   1st Qu.:1     1st Qu.:1    
 -100   :    1   Median :1     Median :1    
 -1000  :    1   Mean   :1     Mean   :1    
 -10000 :    1   3rd Qu.:1     3rd Qu.:1    
 -10001 :    1   Max.   :1     Max.   :1    
 (Other):34570                              

Extract edge information for the main graph

We only considered edges information of nodes in MC2_nodes and used filter to filter out the edges for those nodes. We ignored the dataset and type column and used select to select the rest of the attributes of edges of the main graph. Used glimpse to view the data.

Show code
MC2_edges <-as_tibble(MC2$links) %>%
  select(source, target, arrivaldate,hscode,weightkg,volumeteu,valueofgoodsusd,valueofgoods_omu) %>% 
  distinct()

glimpse(MC2_edges)
Rows: 5,309,087
Columns: 8
$ source           <chr> "AquaDelight Inc and Son's", "AquaDelight Inc and Son…
$ target           <chr> "BaringoAmerica Marine Ges.m.b.H.", "BaringoAmerica M…
$ arrivaldate      <chr> "2034-02-12", "2034-03-13", "2028-02-07", "2028-02-23…
$ hscode           <chr> "630630", "630630", "470710", "470710", "470710", "47…
$ weightkg         <int> 4780, 6125, 10855, 11250, 11165, 11290, 9000, 19490, …
$ volumeteu        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ valueofgoodsusd  <dbl> NA, NA, NA, NA, NA, NA, 87110, 188140, NA, 221110, 58…
$ valueofgoods_omu <dbl> 141015, 141015, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
Note

Wrong data types for arrivaldate as well as other character type data need to be converted.

We used as.factor to convert source, target, hscode. mutate and ymd were used to convert arrival date data type to date and extract day of the week information from the date with wday

Show code
MC2_edges<-MC2_edges %>% 
  mutate(arrivaldate = ymd(arrivaldate)) %>% 
  mutate(weekday = wday(arrivaldate,
                        label = TRUE,
                        abbr = FALSE)) %>% 
  mutate(monthyear=as.yearmon(arrivaldate)) %>% 
  mutate(year=year(arrivaldate))%>% 
  filter(source!=target)

MC2_edges$source<-as.factor(MC2_edges$source)
MC2_edges$target<-as.factor(MC2_edges$target)
glimpse(MC2_edges)
Rows: 5,158,933
Columns: 11
$ source           <fct> AquaDelight Inc and Son's, AquaDelight Inc and Son's,…
$ target           <fct> BaringoAmerica Marine Ges.m.b.H., BaringoAmerica Mari…
$ arrivaldate      <date> 2034-02-12, 2034-03-13, 2028-02-07, 2028-02-23, 2028…
$ hscode           <chr> "630630", "630630", "470710", "470710", "470710", "47…
$ weightkg         <int> 4780, 6125, 10855, 11250, 11165, 11290, 9000, 19490, …
$ volumeteu        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ valueofgoodsusd  <dbl> NA, NA, NA, NA, NA, NA, 87110, 188140, NA, 221110, 58…
$ valueofgoods_omu <dbl> 141015, 141015, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ weekday          <ord> Sunday, Monday, Monday, Wednesday, Monday, Monday, We…
$ monthyear        <yearmon> Feb 2034, Mar 2034, Feb 2028, Feb 2028, Sep 2028,…
$ year             <dbl> 2034, 2034, 2028, 2028, 2028, 2028, 2028, 2028, 2028,…

summary was used to get a sense of data distribution. Since volumeteu, valuesofgoods_usd and valueofgoods_omu mostly contain 0 or NA values, we can disregard these columns in the analysis. We also observed that Friday seems to have the highest number of shipments while Wednesday has the lowest number of shipments.

Show code
summary(MC2_edges)
                                        source       
 Fresh Limited Liability Company           : 152521  
 Uttar Pradesh s CJSC                      : 116151  
 bǐ mù yú Sagl Distribution                :  89207  
 Coastal Cruisers Pic Shipping             :  82010  
 Chhattisgarh   Marine ecology A/S Delivery:  81212  
 Mar y Luna Sagl                           :  76853  
 (Other)                                   :4560979  
                              target         arrivaldate        
 hǎi dǎn Corporation Wharf       : 448746   Min.   :2028-01-01  
 Caracola del Sol Services       : 190708   1st Qu.:2029-09-16  
 Costa de la Felicidad Shipping  : 160527   Median :2031-05-06  
 Panope Limited Liability Company: 158081   Mean   :2031-06-03  
 Saltwater Supreme ОАО Forwading : 157800   3rd Qu.:2033-02-26  
 Mar del Este CJSC               : 126180   Max.   :2034-12-30  
 (Other)                         :3916891                       
    hscode             weightkg           volumeteu      valueofgoodsusd    
 Length:5158933     Min.   :        0   Min.   :   0.0   Min.   :0.000e+00  
 Class :character   1st Qu.:     3475   1st Qu.:   0.0   1st Qu.:2.843e+04  
 Mode  :character   Median :    11070   Median :   0.0   Median :7.385e+04  
                    Mean   :    38794   Mean   :   1.5   Mean   :8.716e+05  
                    3rd Qu.:    19960   3rd Qu.:   0.0   3rd Qu.:1.601e+05  
                    Max.   :495492485   Max.   :1215.0   Max.   :2.258e+11  
                                        NA's   :412222   NA's   :2779774    
 valueofgoods_omu        weekday         monthyear         year     
 Min.   :    1100   Sunday   :788160   Min.   :2028   Min.   :2028  
 1st Qu.:  148130   Monday   :745243   1st Qu.:2030   1st Qu.:2029  
 Median :  504485   Tuesday  :675274   Median :2031   Median :2031  
 Mean   : 1665142   Wednesday:601902   Mean   :2031   Mean   :2031  
 3rd Qu.: 1202560   Thursday :691915   3rd Qu.:2033   3rd Qu.:2033  
 Max.   :44744530   Friday   :866322   Max.   :2035   Max.   :2034  
 NA's   :5158652    Saturday :790117                                

Filter out irrelevant edges

We use n_distinct() to check the number of hscode available in the dataset. There are many hscode values that are not relevant to our project. For example, hscode 470710 is for shipment of paper or paperboard. hscode that are relevant to our fish data will include only those that starts with:

  • “301”: Live Fish

  • “302: Fresh or chilled fish exclude fish fillets and fish meat in 304

  • “303”: Frozen fish

  • “304”: Fish fillets and fish meat (fresh, chilled or frozen)

  • “305”: Fish, dried, salted or in brine

  • “1604”: Extracts and juices of meat, fish or crustaceans molluscs or other aquatic invertebrates

  • “1605”: Prepared or preserved fish

we use substr() to extract the first 3 digits from hscode and then filter rows with only the hscode that starts with 301, 302, 303, 304, 305. We also extract the first 4 digits from hscode and filter those with 1604 and 1605.

Show code
n_distinct(MC2_edges$hscode)
[1] 4748
Show code
MC2_edges_filtered<-MC2_edges %>% 
  mutate(sub_hscode3=substr(hscode,1,3)) %>%
  mutate(sub_hscode4=substr(hscode,1,4)) %>% 
  filter(sub_hscode3 %in% c("301","302","303","304","305")|sub_hscode4 %in% c("1604","1605")) 

MC2_edges_filtered$hscode<-as.factor(MC2_edges_filtered$hscode)

summary(MC2_edges_filtered$hscode)
 304620  160414  304610  160521  304710  304750  304810  160510  160529  304870 
  87327   81628   36540   30238   27165   23911   21418   20530   14274   11508 
 303890  304720  304890  304990  160540  160420  303230  304740  303660  160413 
  10984    9714    9597    9444    9199    8977    8612    8505    7370    7148 
 303120  303130  160553  304790  304830  160419  304820  305320  160411  303630 
   6972    6690    4758    4530    3720    3007    2675    2192    2111    1960 
 303110  160556  305420  303310  304940  160415  303540  305410  160416  160551 
   1883    1871    1785    1634    1588    1446    1385    1248    1246    1225 
 160554  303290  303640  304390  305510  303420  303510  303140  160417  303530 
   1175    1072    1049     751     751     730     721     700     690     615 
 305630  305590  303690  303390  303330  304840  303670  303910  160559  304910 
    612     592     588     410     396     384     383     363     357     355 
 303590  303240  303830  304850  160552  304950  160555  160412  160558  305200 
    324     317     316     280     260     236     218     201     176     150 
 160432  303810  304690  303410  303430  304420  305490  301190  304490  303550 
    145     140     137     128     127     117     112     111     107     105 
 303570  160530  305100  305530  303250  305620  305690  302460  303820  305540 
    100      88      86      79      78      78      78      74      68      63 
 301110  303840  160557  303190  304410  304460  304320  305790  305390 (Other) 
     62      62      59      58      54      50      48      48      43     807 

Identify key source and target nodes

Top 10% of the source with the highest number of shipments account for 80% of the total shipments

Show code
Top_sources_tw<-MC2_edges_filtered %>%
  group_by(source) %>% 
  summarise(weight_total=n()) %>% 
  arrange(weight_total)%>% 
  mutate(cm_weight_pct=100*cumsum(weight_total)/sum(weight_total)) %>% 
  mutate(rec=1,
         cm_pop_pct=100*cumsum(rec)/sum(rec))


Top_sources_tw %>%  
  ggplot(aes(cm_pop_pct,cm_weight_pct))+
  geom_line()+
  scale_x_continuous(name="Percent of population", limits=c(0,100), breaks=c(0,20,40,60,80,100))+
  scale_y_continuous(name="Cumulative weight percentage", limits=c(0,100), breaks=c(0,20,40,60,80,100))+
  ggtitle("Cumulative Percentage of Number of Shipments vs Percent of Population")+
  theme_minimal()

Top 20% of the top sources by distinct number of target ship to 70% of the targets

Show code
Top_sources_dist<-MC2_edges_filtered %>%
  group_by(source) %>% 
  summarise(weight_distinct=n_distinct(target)) %>% 
  arrange(weight_distinct)%>% 
  mutate(cm_weight_pct=100*cumsum(weight_distinct)/sum(weight_distinct)) %>% 
  mutate(rec=1,
         cm_pop_pct=100*cumsum(rec)/sum(rec))


Top_sources_dist %>%  
  ggplot(aes(cm_pop_pct,cm_weight_pct))+
  geom_line()+
  scale_x_continuous(name="Percent of population", limits=c(0,100), breaks=c(0,20,40,60,80,100))+
  scale_y_continuous(name="Cumulative distinct weight percentage", limits=c(0,100), breaks=c(0,20,40,60,80,100))+
    ggtitle("Cumulative Percentage of Number of Distinct Targets vs Percent of Population")+
  theme_minimal()

We filter the edge information to contain only the top sources that have not only highest number of total connections but also highest number of unique connections. We extracted 284 top sources based on this method.

Show code
id1<-Top_sources_tw %>% 
  filter(cm_pop_pct>95) %>% 
  select(source)

id2<-Top_sources_dist %>% 
  filter(cm_pop_pct>95) %>% 
  select(source)

top_sources<-merge(id1,id2, by='source')

MC2_edges_filtered<-merge(top_sources, MC2_edges_filtered, by='source')

Similar analysis is performed to identify the top targets that account for the majority of the weight of the graph as well as have the highest distinct number of sources.

Show code
Top_targets_tw<-MC2_edges_filtered %>%
  group_by(target) %>% 
  summarise(weight_total=n()) %>% 
  arrange(weight_total)%>% 
  mutate(cm_weight_pct=100*cumsum(weight_total)/sum(weight_total)) %>%   mutate(rec=1,
         cm_pop_pct=100*cumsum(rec)/sum(rec))


Top_targets_tw %>%  
  ggplot(aes(cm_pop_pct,cm_weight_pct))+
  geom_line()+
  scale_x_continuous(name="Percent of population", limits=c(0,100), breaks=c(0,20,40,60,80,100))+
  scale_y_continuous(name="Cumulative weight percentage", limits=c(0,100), breaks=c(0,20,40,60,80,100))+
  ggtitle("Cumulative Percentage of Number of Shipments vs Percent of Population")+
  theme_minimal()

Show code
Top_targets_dist<-MC2_edges_filtered %>%
  group_by(target) %>% 
  summarise(weight_distinct=n_distinct(source)) %>% 
  arrange(weight_distinct)%>% 
  mutate(cm_weight_pct=100*cumsum(weight_distinct)/sum(weight_distinct)) %>% 
  mutate(rec=1,
         cm_pop_pct=100*cumsum(rec)/sum(rec))


Top_targets_dist %>%  
  ggplot(aes(cm_pop_pct,cm_weight_pct))+
  geom_line()+
  scale_x_continuous(name="Percent of population", limits=c(0,100), breaks=c(0,20,40,60,80,100))+
  scale_y_continuous(name="Cumulative distinct weight percentage", limits=c(0,100), breaks=c(0,20,40,60,80,100))+
    ggtitle("Cumulative Percentage of Number of Distinct Targets vs Percent of Population")+
  theme_minimal()

Show code
id3<-Top_targets_tw %>% 
  filter(cm_pop_pct>97.5) %>% 
  select(target)

id4<-Top_targets_dist %>% 
  filter(cm_pop_pct>85) %>% 
  select(target)

top_targets<-merge(id3,id4, by='target')

MC2_edges_st<-merge(top_targets, MC2_edges_filtered, by='target')

We aggregate the weight of the filtered edges as well as the total weight in kg of the shipments based on source, target, hscode, weekday and monthyear.We will use this data for temporal analysis

Show code
MC2_edges_aggregated<-MC2_edges_filtered %>%
  group_by(source, target, hscode, weekday, monthyear,year) %>% 
  summarise(weight=n(), 
            weightkg=sum(weightkg)) %>%
  mutate(hscode=as.factor(hscode)) %>% 
  ungroup()
summary(MC2_edges_aggregated)
                                                 source     
 Sea Breezes S.A. de C.V. Freight                   : 2753  
 Estrella de la Costa SRL                           : 1813  
 OceanSavor Surf Limited Liability Company Logistics: 1677  
 nián yú Ltd. Corporation                           : 1577  
 Shou gan  Oyj Overseas                             : 1420  
 Kerala  Market - Ges.m.b.H. Cargo                  : 1291  
 (Other)                                            :85459  
                            target          hscode           weekday     
 Mar del Este CJSC             : 9299   304610 :16822   Sunday   :14821  
 hǎi dǎn Corporation Wharf     : 7794   160414 : 9302   Monday   :13608  
 Caracola del Sol Services     : 4536   304620 : 7872   Tuesday  :13694  
 Pao gan SE Seal               : 3731   304810 : 7761   Wednesday:10803  
 Costa de la Felicidad Shipping: 3686   303230 : 4798   Thursday :12317  
 Olas del Sur Ltd              : 3230   304710 : 4747   Friday   :15981  
 (Other)                       :63714   (Other):44688   Saturday :14766  
   monthyear         year          weight           weightkg      
 Min.   :2028   Min.   :2028   Min.   :  1.000   Min.   :      0  
 1st Qu.:2030   1st Qu.:2029   1st Qu.:  1.000   1st Qu.:  21390  
 Median :2032   Median :2031   Median :  2.000   Median :  37220  
 Mean   :2032   Mean   :2031   Mean   :  2.798   Mean   :  63838  
 3rd Qu.:2033   3rd Qu.:2033   3rd Qu.:  2.000   3rd Qu.:  51960  
 Max.   :2035   Max.   :2034   Max.   :160.000   Max.   :3293520  
                                                                  

To plot the overall network graph, we will further aggregate this data by source, target and year. As hscode 304620 has the highest count, we will also look at the graph in with this hscode.

Show code
MC2_edges_graph<-MC2_edges_aggregated%>%
  filter(hscode=='304620'& (year=='2028'| year=='2034')) %>% 
  group_by(source, target, year) %>% 
  summarise(weight=n()) %>%
  ungroup()

Data Visualization

Visualize the main graph

Show code
MC2_graph<-as_tbl_graph(MC2_edges_graph,
                        directed = T)

We can see that the number of edges for nodes at the boundary of the graph decreases from 2028 to 2034. Since our graph is directed, this means that there are fewer shipments between the weakly connected nodes. The nodes with high degree centrality for this type of shipments do not change significantly from 2028 to 2034.

Show code
# |fig-width: 100 
# |fig-height: 100 
MC2_graph %>%
  mutate(centrality = centrality_authority()) %>% 
  ggraph(layout = "fr")+   
  geom_edge_link(aes(width=weight), 
                 alpha=0.2)+
  scale_edge_width(range = c(0.1, 5))+
  geom_node_point(aes(size = centrality, colour = centrality))+   
  theme_graph()+
  scale_color_continuous(guide = 'legend') +
  theme_graph()+
  facet_edges(~year)+
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  theme(legend.position = 'bottom')

Next we will use visnetwork to create an interactive graph and zoom into the high in degree centrality nodes. We can identify a few companies that have the highest degree centrality: Caracola del Sol Services, hǎi dǎn Corporation Wharf, and Mar del Este CJSC

Show code
edges_df<-MC2_graph%>% 
  activate(edges) %>% 
  as.tibble()
Show code
nodes_df<-MC2_graph%>%
  mutate(centrality = centrality_authority()) %>% 
  activate(nodes) %>% 
  as.tibble() %>%
  rename(label=name) %>% 
  mutate(id=row_number()) %>%
  select(id, label, centrality) 
Show code
visNetwork(nodes=nodes_df,edges=edges_df)%>%    
  visIgraphLayout(layout = "layout_with_fr") %>%  
  visEdges(arrows = "to", 
           smooth = list(enabled = TRUE,              
                         type = "curvedCW")) %>%
  visLegend() %>%
  visLayout(randomSeed=123) %>%    
  visOptions(highlightNearest = TRUE, 
             nodesIdSelection = TRUE)

We change the arrows option to from to identify the sources with high outdegree centrality. There are more companies with high out degree centrality such as Sea Breezes S.A. de C.V. Freight, Arena del Sol Pic.
It is noteworthy that the high indegree source Caracola del Sol Services identified in the previous link also have a high outdegree centrality. We can take a closer look at the network and the temporal trend of this company.

Show code
visNetwork(nodes=nodes_df,edges=edges_df)%>%    
  visIgraphLayout(layout = "layout_with_fr") %>%  
  visEdges(arrows = "from", 
           smooth = list(enabled = TRUE,              
                         type = "curvedCW")) %>%
  visLegend() %>%
  visLayout(randomSeed=123) %>%    
  visOptions(highlightNearest = TRUE, 
             nodesIdSelection = TRUE)

Community Dectection

We used group_louvain() which implements the multi-level modularity optimization algorithm for finding community structure. It is based on the modularity measure and a hierarchical approach. Initially, each vertex is assigned to a community on its own. In every step, vertices are re-assigned to communities in a local, greedy way: each vertex is moved to the community with which it achieves the highest contribution to modularity. When no vertices can be reassigned, each community is considered a vertex on its own, and the process starts again with the merged communities. The process stops when there is only a single vertex left or when the modularity cannot be increased any more in a step.

Show code
graph<-as_tbl_graph(MC2_edges_graph, directed=F)
graph<-graph%>%
  mutate(community = as.factor(group_louvain())) 
Show code
edges<-graph %>% 
  activate(edges) %>% 
  as.tibble()

nodes<-graph %>% 
  activate(nodes) %>% 
  as.tibble() %>% 
  rename(label=name) %>% 
  mutate(id=row_number()) %>% 
  rename(group=community)
Show code
visNetwork(nodes,
           edges) %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visLegend() %>%
  visLayout(randomSeed = 123) %>% 
  visOptions(highlightNearest = TRUE, 
             nodesIdSelection = TRUE)

Temporal analysis

We see a dip in the overall trend of the shipment volume from 2028 to 2031 and then an increase from 2031 to 2034

Show code
overall_trend <-MC2_edges_filtered%>%
  group_by(monthyear) %>%
  summarise(weight=n(),
            weightkg=mean(weightkg)) %>%
  arrange(monthyear) %>% 
  ungroup()

t<-overall_trend %>% 
  ggplot(aes(x = monthyear, y=weightkg))+
  geom_point()+
  geom_line()+
  theme(legend.position="none")+
  theme_minimal()+
  ggtitle('Overall trend in the monthly average shipment volume in kg over time')+
  scale_y_continuous(name="Average shipment volume in kg")+
  scale_x_continuous(name="Month Year", breaks=c(2028,2029,2030,2031,2032,2033,2034,2035))+
  theme_minimal()

ggplotly(t,
         tooltip=c("x","y"))

We observed the same trend the the monthly total number of shipments over time.

Show code
t2<-overall_trend %>% 
  ggplot(aes(x = monthyear, y=weight))+
  geom_point()+
  geom_line()+
  theme(legend.position="none")+
  theme_minimal()+
  ggtitle('Overall trend in the monthly total number of shipments over time')+
  scale_y_continuous(name="No of shipments")+
  scale_x_continuous(name="Month Year", breaks=c(2028,2029,2030,2031,2032,2033,2034,2035))+
  theme_minimal()

ggplotly(t,
         tooltip=c("x","y"))

Now we will take a closer look at the temporal shipment pattern of one of key node in shipments of hscode ‘304620’

Show code
key_node_28 <-MC2_edges_filtered%>%
  filter(monthyear<'Jan 2029'&
        (source=='Caracola del Sol Services' | target=='Caracola del Sol Services')) %>%
  mutate(monthyear=as.factor(monthyear)) %>% 
  group_by(weekday,monthyear) %>%
  summarise(weight=n(),
            weightkg=mean(weightkg)) %>%
  arrange(weekday,monthyear) %>% 
  ungroup()

key_node_28 %>% 
  ggplot(aes(monthyear,weekday, fill=weight))+
geom_tile(colour='white',
          size=0.1)+ 
scale_fill_gradient(name = "No of shipments",
                    low = "sky blue", 
                    high = "dark blue") +
   theme_minimal()+
labs(x = NULL, 
     y = NULL, 
     title = "No of shipments by month year and weekday in 2028") +
theme(axis.text.x=element_text(angle=30, vjust=0.5),
      plot.title = element_text(hjust = 0.5),
      legend.title = element_text(size = 8),
      legend.text = element_text(size = 6) )

Show code
key_node_34 <-MC2_edges_filtered%>%
  filter(monthyear>='Jan 2034'&
        (source=='Caracola del Sol Services' | target=='Caracola del Sol Services')) %>% 
  mutate(monthyear=as.factor(monthyear)) %>% 
  group_by(weekday,monthyear) %>%
  summarise(weight=n(),
            weightkg=mean(weightkg)) %>%
  arrange(weekday,monthyear) %>% 
  ungroup()

key_node_34 %>% 
  ggplot(aes(monthyear,weekday, fill=weight))+
geom_tile(colour='white',
          size=0.1)+ 
scale_fill_gradient(name = "No of shipments",
                    low = "sky blue", 
                    high = "dark blue") +
   theme_minimal()+
labs(x = NULL, 
     y = NULL, 
     title = "No of shipments by month year and weekday in 2034") +
theme(axis.text.x=element_text(angle=30, vjust=0.5),
      plot.title = element_text(hjust = 0.5),
      legend.title = element_text(size = 8),
      legend.text = element_text(size = 6) )

We will plot the percentage change in the monthly total number of shipments over time by individual source to identify any abnormal increase or decrease in the number of shipments. From the graph, we can get the name of the company from the tooltip function.

Show code
source_trend<-MC2_edges_filtered%>%
  group_by(source, monthyear) %>%
  summarise(weight=n(),
            weightkg=sum(weightkg)) %>%
  arrange(monthyear) %>% 
  ungroup()
Show code
source_trend<-source_trend %>% 
  group_by(source) %>%
  mutate(weightkg_pct_change=round((weightkg/lag(weightkg)-1)*100,0)) %>% 
  mutate(weight_pct_change=round((weight/lag(weight)-1)*100,0)) %>% 
  na.omit()
Show code
st<-ggplot(source_trend, aes(x = monthyear, y = weight_pct_change,
             colour=source))+
  geom_point()+
  geom_line()+  
  ggtitle('Percentage change of the monthly total number of shipments over time shipped by each source')+
  scale_y_continuous(name="Percentage change of number of shipments")+
  scale_x_continuous(name="Month Year", breaks=c(2028,2029,2030,2031,2032,2033,2034,2035))+
  theme_minimal()+
  theme(legend.position="none")
  

ggplotly(st,
         tooltip=c('colour',
         "x",
         "y"))

We will plot the percentage change in the monthly total volume of shipments over time by individual source to identify any abnormal increase or decrease in the volume of shipments. From the graph, we can identify these companies with the tooltip function.

Show code
stkg<-ggplot(source_trend, aes(x = monthyear, y = weightkg_pct_change,
             colour=source))+
  geom_point()+
  geom_line()+  
  ggtitle('Percentage change of volume in kg of shipments over time shipped by each source')+
  scale_y_continuous(name="Percentage change of volume in kg of shipments")+
  scale_x_continuous(name="Month Year", breaks=c(2028,2029,2030,2031,2032,2033,2034,2035))+
  theme_minimal()+
  theme(legend.position="none")
  

ggplotly(stkg,
         tooltip=c('colour',
         "x",
         "y"))

We will perform the same analysis for the targets.

Show code
target_trend<-MC2_edges_filtered%>%
  group_by(target, monthyear) %>%
  summarise(weight=n(),
            weightkg=sum(weightkg)) %>%
  arrange(monthyear) %>% 
  ungroup()
Show code
target_trend<-target_trend %>% 
  group_by(target) %>%
  mutate(weightkg_pct_change=round((weightkg/lag(weightkg)-1)*100,0)) %>% 
  mutate(weight_pct_change=round((weight/lag(weight)-1)*100,0)) %>%
  na.omit()
Show code
twkg<-ggplot(target_trend, aes(x = monthyear, y = weightkg_pct_change,
             colour=target))+
  geom_point()+
  geom_line()+  
  ggtitle('Percentage change of volume in kg of shipments over time received by each target')+
  scale_y_continuous(name="Percentage change of volume in kg of shipments")+
  scale_x_continuous(name="Month Year", breaks=c(2028,2029,2030,2031,2032,2033,2034,2035))+
  theme_minimal()+
  theme(legend.position="none")
  

ggplotly(twkg,
         tooltip=c('colour',
         "x",
         "y"))
Show code
tw<-ggplot(target_trend, aes(x = monthyear, y = weight_pct_change,
             colour=target))+
  geom_point()+
  geom_line()+  
  ggtitle('Percentage change of total number of shipments over time received by each target')+
  scale_y_continuous(name="Percentage change of number of shipments")+
  scale_x_continuous(name="Month Year", breaks=c(2028,2029,2030,2031,2032,2033,2034,2035))+
  theme_minimal()+
  theme(legend.position="none")
  

ggplotly(tw,
         tooltip=c('colour',
         "x",
         "y"))